No description has been provided for this image

Spatial Data Exploration¶

The exploration of data using maps is a key first step during the analytical process. However, this is not as easy as it may seem. Let me get some data organised by the CIA:

  • Tobacco Use
  • Alcohol Comsumption
  • Obesity in Adults
  • Population
  • Area

The data is on GitHub:

In [136]:
mainLink='https://github.com/DACSS-CSSmeths/Spatial-Exploring/raw/refs/heads/main/'

tobLink='dataCIA/Tobacco_use.csv'
alcLink='dataCIA/Alcohol_consumptionPerCapita.csv'
obeLink='dataCIA/Obesity_adultPrevalenceRate.csv'
popLink='dataCIA/Population_total.csv'
areaLink='dataCIA/Area.csv'

import pandas as pd

tobDat=pd.read_csv(mainLink+tobLink)
alcDat=pd.read_csv(mainLink+alcLink)
obeDat=pd.read_csv(mainLink+obeLink)
popDat=pd.read_csv(mainLink+popLink)
areaDat=pd.read_csv(mainLink+areaLink)


# num of rows
tobDat.shape[0],alcDat.shape[0], obeDat.shape[0],popDat.shape[0],areaDat.shape[0]
Out[136]:
(164, 189, 192, 237, 256)

As you see, the amount of countries (rows) in each data is different from the others.

The data merging¶

We need one data frame after merging all of them, the result has problematic column names:

In [137]:
allCia=obeDat.merge(tobDat.merge(alcDat,on='name'),on='name')
allCia.columns
Out[137]:
Index(['name', 'obesityAdults_rate', 'region', 'TobaccoUse_perc', 'region_x',
       'Alcohol_LitersPerCap', 'region_y'],
      dtype='object')

Notice that preprocesing concepts are always needed in any project. In this case:

  • name column appears once, it was the key column for the merging of the three data sets.
  • region was not a key column, but it appeared in all the data frames; so, the column is repeated with a suffix at the end (x,_y). Notice that _x refers to the location of the object related to the _merge() function:left.merge(right)

We will simply subset in this case:

In [138]:
# subsetting
allCia=allCia.loc[:,['name','region','obesityAdults_rate','TobaccoUse_perc','Alcohol_LitersPerCap']]

# check data types
allCia.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 162 entries, 0 to 161
Data columns (total 5 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   name                  162 non-null    object 
 1   region                162 non-null    object 
 2   obesityAdults_rate    162 non-null    float64
 3   TobaccoUse_perc       162 non-null    float64
 4   Alcohol_LitersPerCap  162 non-null    float64
dtypes: float64(3), object(2)
memory usage: 6.5+ KB

Notice that the resulting dataframe got even less countries than any of the other ones after the merge. You would need to apply fuzzy merge if you want to recover some rows during the merge. The good news is that the columns are in the rigth data type. Please review my previous work if these two concepts were not familiar to you.

Let me prepare another merge:

In [139]:
poparea=popDat.merge(areaDat, on=['name','region'])
poparea
Out[139]:
name Population_total region area_SqKm
0 China 1,416,043,270 East and Southeast Asia 9,596,960
1 India 1,409,128,296 South Asia 3,287,263
2 United States 341,963,408 North America 9,833,517
3 Indonesia 281,562,465 East and Southeast Asia 1,904,569
4 Pakistan 252,363,571 South Asia 796,095
... ... ... ... ...
232 Tokelau 1,647 Australia and Oceania 12
233 Paracel Islands 1,440 East and Southeast Asia 8
234 Holy See (Vatican City) 1,000 Europe 0
235 Cocos (Keeling) Islands 593 Australia and Oceania 14
236 Pitcairn Islands 50 Australia and Oceania 47

237 rows × 4 columns

Notice that I used TWO keys this time. This avoid the previous column suffixes. Let's see the data types:

In [140]:
poparea.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 237 entries, 0 to 236
Data columns (total 4 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   name              237 non-null    object
 1   Population_total  237 non-null    object
 2   region            237 non-null    object
 3   area_SqKm         237 non-null    object
dtypes: object(4)
memory usage: 7.5+ KB

As you may have guessed, the numeric columns are not properly recognized as such. You may need to delete the commas and turn the columns into a numerical data type, or just reload the data with other arguments:

In [141]:
popDat=pd.read_csv(mainLink+popLink, thousands=',')
areaDat=pd.read_csv(mainLink+areaLink, thousands=',')

Now we have it right!

In [142]:
poparea=popDat.merge(areaDat, on=['name','region'])
poparea.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 237 entries, 0 to 236
Data columns (total 4 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   name              237 non-null    object
 1   Population_total  237 non-null    int64 
 2   region            237 non-null    object
 3   area_SqKm         237 non-null    int64 
dtypes: int64(2), object(2)
memory usage: 7.5+ KB

Geo Merging¶

Remember we created these maps the previous week:

In [143]:
import geopandas as gpd

mapsLink=mainLink+'maps/worldMaps_Py.gpkg'



# just the list of maps
gpd.list_layers(mapsLink)
Out[143]:
name geometry_type
0 countries_poly MultiPolygon
1 rivers_line MultiLineString
2 cities_point Point

Let's open the countries_poly map:

In [144]:
countries=gpd.read_file(mapsLink, layer='countries_poly')
countries.info()
<class 'geopandas.geodataframe.GeoDataFrame'>
RangeIndex: 252 entries, 0 to 251
Data columns (total 2 columns):
 #   Column    Non-Null Count  Dtype   
---  ------    --------------  -----   
 0   COUNTRY   252 non-null    object  
 1   geometry  252 non-null    geometry
dtypes: geometry(1), object(1)
memory usage: 4.1+ KB

Let's see some contents:

In [145]:
countries.head()
Out[145]:
COUNTRY geometry
0 Aruba (Netherlands) MULTIPOLYGON (((-69.88223 12.41111, -69.94695 ...
1 Antigua and Barbuda MULTIPOLYGON (((-61.73889 17.54055, -61.75195 ...
2 Afghanistan MULTIPOLYGON (((61.27656 35.60725, 61.29638 35...
3 Algeria MULTIPOLYGON (((-5.15213 30.18047, -5.13917 30...
4 Azerbaijan MULTIPOLYGON (((46.54037 38.87559, 46.49554 38...

This map has no interesting information beyond the geometry, which, as we know, serves to plot the polygons.

In [146]:
import matplotlib.pyplot as plt

countries.plot()
plt.show();
No description has been provided for this image

Then, let's add the population data to the map.

Notice that when geoMerging, the map object has to be the LEFT object: GEO_DF.merge(DF)

In [147]:
mapPopu=countries.merge(poparea, left_on='COUNTRY', right_on='name')
mapPopu.info()
<class 'geopandas.geodataframe.GeoDataFrame'>
RangeIndex: 174 entries, 0 to 173
Data columns (total 6 columns):
 #   Column            Non-Null Count  Dtype   
---  ------            --------------  -----   
 0   COUNTRY           174 non-null    object  
 1   geometry          174 non-null    geometry
 2   name              174 non-null    object  
 3   Population_total  174 non-null    int64   
 4   region            174 non-null    object  
 5   area_SqKm         174 non-null    int64   
dtypes: geometry(1), int64(2), object(3)
memory usage: 8.3+ KB

Notice that the merge has now less COUNTRIES. You should consider fuzzy merge to reduce missing values.

Let me color the polygons using the population data:

In [148]:
mapPopu.plot(column='Population_total',legend=True)
Out[148]:
<Axes: >
No description has been provided for this image

Is it that simple? Not really. Of course is not a difficult task. After you have read the recommended reading (Herries, 2018), you should know that population is a count, an example of extensive statistics, and we should process data so that they represent intensive stats when plotted. Let's compute population density, as density is an intensive stat:

In [149]:
mapPopu['popDensity']=mapPopu.Population_total/mapPopu.area_SqKm

Choropleths¶

Are we ready for a choropleth? the popDensity we just created is an intensive stat. Lets' see its distribution:

In [150]:
import seaborn as sea

sea.kdeplot(data=mapPopu, x="popDensity")
sea.rugplot(data=mapPopu, x="popDensity",height=.1)

plt.show()
No description has been provided for this image

Then, you know that outliers are present, let's sort the data:

In [151]:
mapPopu.sort_values(by=['popDensity'],ascending=False)
Out[151]:
COUNTRY geometry name Population_total region area_SqKm popDensity
97 Monaco MULTIPOLYGON (((7.39161 43.72755, 7.3909 43.74... Monaco 31813 Europe 2 15906.500000
139 Singapore MULTIPOLYGON (((103.99054 1.38329, 103.99795 1... Singapore 6028459 East and Southeast Asia 719 8384.504868
11 Bahrain MULTIPOLYGON (((50.65055 26.24472, 50.61305 26... Bahrain 1566888 Middle East 760 2061.694737
101 Malta MULTIPOLYGON (((14.51972 35.8, 14.42389 35.818... Malta 469730 Europe 316 1486.487342
103 Maldives MULTIPOLYGON (((72.97998 7.02778, 72.98272 7.0... Maldives 388858 South Asia 298 1304.892617
... ... ... ... ... ... ... ...
63 Guyana MULTIPOLYGON (((-60.08087 5.16151, -60.08195 5... Guyana 794099 South America 214969 3.694016
68 Iceland MULTIPOLYGON (((-22.02472 64.41888, -22.02444 ... Iceland 364036 Europe 103000 3.534330
9 Australia MULTIPOLYGON (((142.27997 -10.26556, 142.21053... Australia 26768598 Australia and Oceania 7741220 3.457930
168 Namibia MULTIPOLYGON (((14.52485 -22.69207, 14.5274 -2... Namibia 2803660 Africa 824292 3.401295
94 Mongolia MULTIPOLYGON (((116.6718 46.32724, 116.58554 4... Mongolia 3281676 East and Southeast Asia 1564116 2.098103

174 rows × 7 columns

Outliers will make the coloring challenging:

In [152]:
# find Monaco and Singapur!
mapPopu.plot(column='popDensity',figsize=(15,25))
Out[152]:
<Axes: >
No description has been provided for this image

Given this behavior, we could experiment with some transformation:

In [153]:
import numpy as np

#log of var
mapPopu['popDensity_log10']=np.log10(mapPopu.popDensity)
mapPopu.plot(column='popDensity_log10',figsize=(15,25))
Out[153]:
<Axes: >
No description has been provided for this image

As I shared at the begining, we need some thinking before coloring the polygons so that it can be an interesteting and useful choropleth.

A Choropleth for CONTINUOUS values¶

Let me use the other variables we have to create a new map:

In [154]:
countriesCIA=countries.merge(allCia, left_on='COUNTRY', right_on='name')
countriesCIA.info()
<class 'geopandas.geodataframe.GeoDataFrame'>
RangeIndex: 147 entries, 0 to 146
Data columns (total 7 columns):
 #   Column                Non-Null Count  Dtype   
---  ------                --------------  -----   
 0   COUNTRY               147 non-null    object  
 1   geometry              147 non-null    geometry
 2   name                  147 non-null    object  
 3   region                147 non-null    object  
 4   obesityAdults_rate    147 non-null    float64 
 5   TobaccoUse_perc       147 non-null    float64 
 6   Alcohol_LitersPerCap  147 non-null    float64 
dtypes: float64(3), geometry(1), object(3)
memory usage: 8.2+ KB

The merge gave us again a GeoDF, let's see the distribution of Tobacco use:

In [155]:
sea.kdeplot(data=countriesCIA, x="TobaccoUse_perc")
sea.rugplot(data=countriesCIA, x="TobaccoUse_perc",height=.1)

plt.show()
No description has been provided for this image

Now the map:

In [156]:
countriesCIA.plot(column='TobaccoUse_perc',legend=True)
plt.show();
No description has been provided for this image

The variable has already been transformed, and represent an intensive stat. But this is not the best we can do (review the recommended reading)::

  • We are allowing the program to choose the color palette.
  • We are not clarifying that we have missing values.
  • The axes coordinates may not be needed.
  • We have too many shades of colors. Discretization may be needed.

About color palettes¶

The default is viridis:

In [157]:
countriesCIA.plot(column='TobaccoUse_perc',legend=True,cmap='viridis')
plt.show();
No description has been provided for this image

Of course there are other palettes.

In [158]:
countriesCIA.plot(column='TobaccoUse_perc',legend=True,cmap='magma')
plt.show();
No description has been provided for this image

Remember to choose a palette that DOES NOT have red and green simultaneously, colorblinded people have trouble when both are present.

About missing values¶

You should inform what is missing. A good strategy is to create another map of only the borders:

In [159]:
countries.dissolve()
Out[159]:
geometry COUNTRY
0 MULTIPOLYGON (((-159.77335 -79.86084, -159.924... Aruba (Netherlands)

The cell 'Aruba' can be changed to 'world', but it is not really needed. Let me go ahead and create the new map, and use it a as base layer.

In [160]:
worldBorders=countries.dissolve() # saving the dissolved map

# plotting two layers:
base=worldBorders.plot(color='lightgrey') # layer 1
countriesCIA.plot(column='TobaccoUse_perc',legend=True,cmap='magma',
                  ax=base) # layer 2, on top on layer 1
base.set_title("Tobacco Use")
plt.show();
No description has been provided for this image

About axes coordinates¶

Coordinates are important, but in some cases they can be omitted. You need to evaluate that on a case by case basis.

In [161]:
base=worldBorders.plot(color='lightgrey')
countriesCIA.plot(column='TobaccoUse_perc',
                  legend=True,
                  cmap='magma',
                  ax=base)
# here!
base.set_title("Tobacco Use - (non discretized)")
base.set_axis_off()


plt.show();
No description has been provided for this image

I will share the code to create this last map in R too.

A Choropleth for DISCRETIZED values¶

Discretizing is about cutting the variable in intervals. This is a usual operation, which is used by histogramas to compute the width of each bin:

In [162]:
countriesCIA.loc[:,'TobaccoUse_perc'].hist()  
plt.show();
No description has been provided for this image

For histograms, 10 bins is the default, and generally that is a good idea; but in maps we should try 3 or 5.

The bining schemes¶

Let me compute 5 bins using different schemes:

In [163]:
## make sure you have
# !pip show pysal mapclassify numpy
In [164]:
import mapclassify 
import numpy as np

np.random.seed(12345) # so we all get the same results!

# let's try 5 bins
K=5
theVar=countriesCIA.TobaccoUse_perc


# same interval width, easy interpretation
ei5 = mapclassify.EqualInterval(theVar, k=K)
# same interval width based on standard deviation, easy - but not as the previous one, poor when high skewness
msd = mapclassify.StdMean(theVar)
# interval width varies, counts per interval are close, not easy to grasp, repeated values complicate cuts                                
q5=mapclassify.Quantiles(theVar,k=K)
# based on similarity, good for multimodal data 
mb5 = mapclassify.MaximumBreaks(theVar, k=K)
# based on similarity, optimizer
fj5 = mapclassify.FisherJenks(theVar, k=K)
# based on similarity, optimizer
jc5 = mapclassify.JenksCaspall(theVar, k=K)
# based on similarity, optimizer
mp5 = mapclassify.MaxP(theVar, k=K) 

###### based on similarity, good for skewed data
ht = mapclassify.HeadTailBreaks(theVar) # no K needed
/Users/JoseManuel/opt/anaconda3/envs/CSS_meth_3_13/lib/python3.13/site-packages/IPython/core/interactiveshell.py:3577: UserWarning: Numba not installed. Using slow pure python version.
  exec(code_obj, self.user_global_ns, self.user_ns)

How can we select the right classification? Let me use the the Absolute deviation around class median (ADCM) to make the comparisson:

In [165]:
class5 = ei5,msd, q5,mb5,  ht, fj5, jc5, mp5
# Collect ADCM for each classifier
fits = np.array([ c.adcm for c in class5])
# Convert ADCM scores to a DataFrame
adcms = pd.DataFrame(fits)
# Add classifier names
adcms['classifier'] = [c.name for c in class5]
# Add column names to the ADCM
adcms.columns = ['ADCM', 'Classifier']

Now, plot the adcms:

In [166]:
adcms.sort_values('ADCM').plot.barh(x='Classifier')
plt.show();
No description has been provided for this image

As you see, the best scheme is 'Fisher Jenks'. Then, we can request that in the plot:

In [167]:
base=worldBorders.plot(color='lightgrey',
                     figsize=(15, 10))
countriesCIA.plot(column='TobaccoUse_perc', cmap='viridis', ax=base,
                     scheme="fisher_jenks",
                     linewidth=0., 
                     legend=True,
                     legend_kwds={"title": "TOBACCO USE (%)\n(missing data in grey)"}
                   )
leg = base.get_legend()
leg.set_bbox_to_anchor((0.2, 0.6)) #(left, bottom)
base.set_axis_off()
plt.show();
/Users/JoseManuel/opt/anaconda3/envs/CSS_meth_3_13/lib/python3.13/site-packages/geopandas/plotting.py:752: UserWarning: Numba not installed. Using slow pure python version.
  binning = mapclassify.classify(
No description has been provided for this image

Above you see the bins limits that Fisher Jenks gave you.

In [168]:
# the code assigned to each country
fj5.yb
Out[168]:
array([2, 2, 2, 2, 2, 3, 2, 1, 2, 1, 0, 2, 2, 4, 0, 4, 1, 0, 3, 4, 1, 4,
       1, 1, 1, 2, 0, 2, 2, 3, 0, 2, 0, 0, 1, 4, 1, 0, 1, 2, 2, 3, 0, 0,
       0, 2, 2, 3, 3, 0, 2, 3, 0, 1, 0, 4, 3, 1, 3, 1, 2, 2, 1, 2, 0, 4,
       1, 2, 4, 1, 2, 3, 4, 4, 3, 0, 3, 2, 2, 3, 3, 3, 0, 0, 1, 2, 0, 2,
       0, 2, 3, 1, 2, 1, 0, 1, 0, 2, 1, 3, 4, 1, 1, 0, 2, 2, 0, 2, 4, 0,
       1, 3, 2, 3, 1, 1, 2, 2, 0, 2, 1, 1, 3, 4, 2, 2, 2, 3, 0, 0, 2, 4,
       0, 0, 0, 1, 2, 2, 1, 2, 1, 2, 1, 2, 1, 1, 4])
In [169]:
# the frequency table with the breaks
fj5.table
Out[169]:
<bound method MapClassifier.table of FisherJenks

   Interval      Count
----------------------
[ 3.50, 10.90] |    31
(10.90, 18.50] |    34
(18.50, 26.40] |    46
(26.40, 33.50] |    21
(33.50, 48.50] |    15>

Recoding and labelling¶

We can use that information to create an ordinal variable:

In [170]:
# new variable
countriesCIA['tobacco_code']=fj5.yb

# create 'tobacco_levels' as a copy of 'tobacco_code'
countriesCIA=countriesCIA.assign(tobacco_levels=countriesCIA.tobacco_code)

Notice that I have two new columns with the same information.

Now, let's give labels to one of them:

In [171]:
# map of labels for the levels
newLevels={0:'1.very low',1:'2.low', 2:'3.average',3:'4.high', 4:'5.very high'}

# recoding
countriesCIA.replace({'tobacco_levels':newLevels}, inplace=True)

This is our new plot using the diverging palette PiYG (I added _r to reverse the color order):

In [172]:
base=worldBorders.plot(color='lightgrey',
                     figsize=(15, 10))
countriesCIA.plot(column='tobacco_levels',ax=base,
                     cmap='PiYG_r', # reversed
                     categorical=True,
                     linewidth=0., 
                     legend=True,
                     legend_kwds={"title": "TOBACCO USE (%)\n(missing data in grey)"}
                   )
base.set_axis_off()

# locating the legend
leg = base.get_legend()
leg.set_bbox_to_anchor((0.2, 0.6)) #(left, bottom), 1 is max value (right/top)

plt.show();
No description has been provided for this image

I will share the R version for this map.

Custom bins¶

Remember that we should always consider custom made bins, instead of "automatically" detected ones. This is important when some thresholds are institutionalised. The process is simple:

  • Check values:
In [173]:
countriesCIA.TobaccoUse_perc.describe()
Out[173]:
count    147.000000
mean      20.278231
std        9.723848
min        3.500000
25%       11.900000
50%       20.800000
75%       26.100000
max       48.500000
Name: TobaccoUse_perc, dtype: float64
  • Decide breaks:
In [174]:
customBreaks=[0,5,15,30,40,100]
pd.cut(countriesCIA.TobaccoUse_perc, customBreaks,ordered=True,include_lowest=True).value_counts(sort=False)
Out[174]:
TobaccoUse_perc
(-0.001, 5.0]     3
(5.0, 15.0]      51
(15.0, 30.0]     66
(30.0, 40.0]     25
(40.0, 100.0]     2
Name: count, dtype: int64
  • Save the new column with labels:
In [175]:
theCustomLabels=['1. below5','2.(5-15]','3.(15-30]','4. (30-40]' , '5. above40']
countriesCIA['tobacco_custom']=pd.cut(countriesCIA.TobaccoUse_perc, 
                                         customBreaks,
                                         ordered=True,
                                         include_lowest=True,
                                         labels=theCustomLabels)
countriesCIA.info()
<class 'geopandas.geodataframe.GeoDataFrame'>
RangeIndex: 147 entries, 0 to 146
Data columns (total 10 columns):
 #   Column                Non-Null Count  Dtype   
---  ------                --------------  -----   
 0   COUNTRY               147 non-null    object  
 1   geometry              147 non-null    geometry
 2   name                  147 non-null    object  
 3   region                147 non-null    object  
 4   obesityAdults_rate    147 non-null    float64 
 5   TobaccoUse_perc       147 non-null    float64 
 6   Alcohol_LitersPerCap  147 non-null    float64 
 7   tobacco_code          147 non-null    int64   
 8   tobacco_levels        147 non-null    object  
 9   tobacco_custom        147 non-null    category
dtypes: category(1), float64(3), geometry(1), int64(1), object(4)
memory usage: 10.8+ KB

Interactive version¶

Interactive choropleths are important not because they may seem impressive, but because you need them to help you inform values as you visit a polygon (tooltips), and if you need to zoom in and out constantly.

An interactive version follows:

In [176]:
countriesCIA.explore(
    column="tobacco_levels",  # make choropleth based on "fisherJenks_levels" column
    tooltip=["name","TobaccoUse_perc"],  # what to show on hover
    tiles="CartoDB positron",  # use "CartoDB positron" tiles
    cmap="PiYG_r",  # colormap
    style_kwds=dict(color="black"),  # use black outline
    legend_kwds={'caption':'Tobacco Use<br>(missing has no tooltip)'}
)
Out[176]:
Make this Notebook Trusted to load map: File -> Trust Notebook

Behind the scenes, geopandas is using folium. The code I will share in R will use leaflet for interactive features.

Creating layers from the discrete values¶

Let me subset the map with CIA data:

In [177]:
countriesCIA_good=countriesCIA[countriesCIA.tobacco_code==0]
countriesCIA_mean=countriesCIA[countriesCIA.tobacco_code==2]

Now, plot them in one map:

In [178]:
fig, axes = plt.subplots(nrows=1,ncols=2,figsize=(15, 10))
# main title
fig.suptitle('Tobacco Use', size=16,y=0.7)

# map1 
base1=worldBorders.plot(color='lightgrey',ax=axes[0])
countriesCIA_good.plot(color='green',ax=base1)
base1.set_title('The countries with lowest use')
base1.set_axis_off()

#map2
base2=worldBorders.plot(color='lightgrey',ax=axes[1])
countriesCIA_mean.plot(color='orange',ax=base2)
base2.set_title('The countries with average use')
base2.set_axis_off()


#display
plt.show();
No description has been provided for this image

I will share the R code for this last map in R, using facets.

Let me use ONE map to represent the previous one (with two sub maps)in an interactive way:

In [179]:
import folium # I need folium

base = countriesCIA_good.explore(
    color='green',
    tooltip=["name","TobaccoUse_perc"],  # what to show on hover
    name="good",  # name of the layer in the map
)

countriesCIA_mean.explore(
    m=base,  # the 'ax'
    tooltip=["name","TobaccoUse_perc"],  # what to show on hover
    color="orange",
    name="average",  
)


folium.LayerControl().add_to(base)  # use folium to add layer control

base  # show map
Out[179]:
Make this Notebook Trusted to load map: File -> Trust Notebook

Currently we have:

In [180]:
countriesCIA.head()
Out[180]:
COUNTRY geometry name region obesityAdults_rate TobaccoUse_perc Alcohol_LitersPerCap tobacco_code tobacco_levels tobacco_custom
0 Afghanistan MULTIPOLYGON (((61.27656 35.60725, 61.29638 35... Afghanistan South Asia 5.5 23.3 0.01 2 3.average 3.(15-30]
1 Algeria MULTIPOLYGON (((-5.15213 30.18047, -5.13917 30... Algeria Africa 27.4 21.0 0.59 2 3.average 3.(15-30]
2 Azerbaijan MULTIPOLYGON (((46.54037 38.87559, 46.49554 38... Azerbaijan Middle East 19.9 24.0 1.38 2 3.average 3.(15-30]
3 Albania MULTIPOLYGON (((20.79192 40.43154, 20.78722 40... Albania Europe 21.7 22.4 4.40 2 3.average 3.(15-30]
4 Armenia MULTIPOLYGON (((46.54037 38.87559, 46.51639 38... Armenia Middle East 20.2 25.5 3.77 2 3.average 3.(15-30]

Beyond choroplething¶

At every one of the previous stages you may need to make several assumptions and use a particular technique. At the end, you do all this work not to color a map in a nice way, but to give some interesting information that may start a decision making process. Most of the time, these requires combining several variables to say something relevant.

Let's prepare the bins for the Alcohol data, and then propose something interesting.

In [181]:
# based on similarity, optimizer
fj5_alc = mapclassify.FisherJenks(countriesCIA.Alcohol_LitersPerCap, k=K)

# get the values
countriesCIA['alcohol_code']=fj5_alc.yb

# a copy in another column
countriesCIA=countriesCIA.assign(alcohol_levels=countriesCIA.alcohol_code)

# recoding
countriesCIA.replace({'alcohol_levels':newLevels}, inplace=True)

countriesCIA.head()
/Users/JoseManuel/opt/anaconda3/envs/CSS_meth_3_13/lib/python3.13/site-packages/IPython/core/interactiveshell.py:3577: UserWarning: Numba not installed. Using slow pure python version.
  exec(code_obj, self.user_global_ns, self.user_ns)
Out[181]:
COUNTRY geometry name region obesityAdults_rate TobaccoUse_perc Alcohol_LitersPerCap tobacco_code tobacco_levels tobacco_custom alcohol_code alcohol_levels
0 Afghanistan MULTIPOLYGON (((61.27656 35.60725, 61.29638 35... Afghanistan South Asia 5.5 23.3 0.01 2 3.average 3.(15-30] 0 1.very low
1 Algeria MULTIPOLYGON (((-5.15213 30.18047, -5.13917 30... Algeria Africa 27.4 21.0 0.59 2 3.average 3.(15-30] 0 1.very low
2 Azerbaijan MULTIPOLYGON (((46.54037 38.87559, 46.49554 38... Azerbaijan Middle East 19.9 24.0 1.38 2 3.average 3.(15-30] 0 1.very low
3 Albania MULTIPOLYGON (((20.79192 40.43154, 20.78722 40... Albania Europe 21.7 22.4 4.40 2 3.average 3.(15-30] 2 3.average
4 Armenia MULTIPOLYGON (((46.54037 38.87559, 46.51639 38... Armenia Middle East 20.2 25.5 3.77 2 3.average 3.(15-30] 1 2.low

This is the distribution:

In [182]:
countriesCIA.alcohol_levels.value_counts()
Out[182]:
alcohol_levels
1.very low     45
2.low          31
5.very high    26
4.high         24
3.average      21
Name: count, dtype: int64

Now that we have two variables, we could ask ourselves what countries are high (level 4) in both tobacco and alcohol consumption?

In [183]:
interestingCountries=countriesCIA[countriesCIA.tobacco_code + countriesCIA.alcohol_code==8]
## or
# countriesCIA[(countriesCIA.tobacco_code==4) & (countriesCIA.alcohol_code==4)]

interestingCountries
Out[183]:
COUNTRY geometry name region obesityAdults_rate TobaccoUse_perc Alcohol_LitersPerCap tobacco_code tobacco_levels tobacco_custom alcohol_code alcohol_levels
21 Bulgaria MULTIPOLYGON (((28.01305 41.98222, 27.9711 41.... Bulgaria Europe 25.0 39.0 11.18 4 5.very high 4. (30-40] 4 5.very high
35 Cyprus MULTIPOLYGON (((33.27229 34.70955, 33.21722 34... Cyprus Europe 21.8 35.1 9.59 4 5.very high 4. (30-40] 4 5.very high
55 Croatia MULTIPOLYGON (((19.03972 44.86138, 19.02972 44... Croatia Europe 24.4 36.9 9.64 4 5.very high 4. (30-40] 4 5.very high
73 Latvia MULTIPOLYGON (((21.06861 56.43555, 21.05777 56... Latvia Europe 23.6 37.0 12.90 4 5.very high 4. (30-40] 4 5.very high

These are the countries:

In [184]:
base=worldBorders.plot(color='lightgrey')
interestingCountries.plot(ax=base)
Out[184]:
<Axes: >
No description has been provided for this image

Here we need to alter the basemap for an area like this:

In [185]:
countriesCIA[countriesCIA.tobacco_code + countriesCIA.alcohol_code==8].plot()
Out[185]:
<Axes: >
No description has been provided for this image

All those polygons are within a rectangle, let me find it:

In [186]:
maskToClip=interestingCountries.dissolve().envelope
maskToClip.plot(color='white',edgecolor='red')
Out[186]:
<Axes: >
No description has been provided for this image

I may use this rectangle to clip the world!

In [187]:
# new map
miniWorld=worldBorders.clip(maskToClip)
# then
base=miniWorld.plot(color='lightgrey')

interestingCountries.plot(ax=base)
plt.show()
No description has been provided for this image

It would be better if we can see the names. Here comes some labor:

  1. Get the coordinates for the country names
In [188]:
allCoords=[x.coords[:][0] for x in interestingCountries.representative_point()]
allCoords
Out[188]:
[(25.170339327283575, 42.74881362915039),
 (33.39597016804631, 35.166385650634766),
 (15.35048927473903, 44.74440383911133),
 (24.453569548787208, 56.89241981506348)]
  1. Use those coordinates to create a new column:
In [189]:
interestingCountries=interestingCountries.assign(coordinates=allCoords)
interestingCountries
Out[189]:
COUNTRY geometry name region obesityAdults_rate TobaccoUse_perc Alcohol_LitersPerCap tobacco_code tobacco_levels tobacco_custom alcohol_code alcohol_levels coordinates
21 Bulgaria MULTIPOLYGON (((28.01305 41.98222, 27.9711 41.... Bulgaria Europe 25.0 39.0 11.18 4 5.very high 4. (30-40] 4 5.very high (25.170339327283575, 42.74881362915039)
35 Cyprus MULTIPOLYGON (((33.27229 34.70955, 33.21722 34... Cyprus Europe 21.8 35.1 9.59 4 5.very high 4. (30-40] 4 5.very high (33.39597016804631, 35.166385650634766)
55 Croatia MULTIPOLYGON (((19.03972 44.86138, 19.02972 44... Croatia Europe 24.4 36.9 9.64 4 5.very high 4. (30-40] 4 5.very high (15.35048927473903, 44.74440383911133)
73 Latvia MULTIPOLYGON (((21.06861 56.43555, 21.05777 56... Latvia Europe 23.6 37.0 12.90 4 5.very high 4. (30-40] 4 5.very high (24.453569548787208, 56.89241981506348)

Let's use that info:

In [190]:
base=miniWorld.plot(color='lightgrey')
interestingCountries.plot(color='yellow', edgecolor='black',ax=base)
for idx, row in interestingCountries.iterrows():
   plt.annotate(text=row['name'], xy=row['coordinates'], horizontalalignment='center', color='blue')
No description has been provided for this image

You will find the code in R for this one too.

The interactive version does not need the clipped map:

In [191]:
interestingCountries.explore()
Out[191]:
Make this Notebook Trusted to load map: File -> Trust Notebook

From here, you can start hypothesing some ideas where location matters (why are you usig a map otherwise!)

Let me save this information in a geojson file:

In [192]:
countriesCIA.to_file('maps/countriesCIA.gpkg', driver='GPKG', layer='cia') 
worldBorders.to_file('maps/countriesCIA.gpkg', driver='GPKG', layer='border') 

HOMEWORK #1¶

This homework requires a repo on GitHub, and you use Python and R.

A. In Python:

1. Work with the column on obesity, and find the best scheme for 5  bins.
2. Use that same scheme to bin the columns on _alcohol_ and _tobacco_.
3. Find the countries that are doing the best in all three variables.
 

B. In R or Python:

1. Plot the countries you found in the last step. YOU ONLY PLOT the NON-interactive version.

The code for the first three steps has to be in one file named code_hw1. The code to plot the countries in part B has to be in another file. All the data has to be read from the PATH from GitHub. For the section B, you create a file named index, and you make sure that you create an html version with the same name. The result of section B will be published as a web page from GitHub.